feat: multi-model adversarial harnesses - structural hardening - security sandbox - battle-tested results#3
Open
lliWcWill wants to merge 8 commits intocoleam00:mainfrom
Open
Conversation
1. Iterative negotiation: negotiateContract now runs up to 3 rounds of generator→evaluator back-and-forth instead of single-pass. Generator counter-proposes based on evaluator feedback until APPROVED. 2. Fail closed on malformed contracts: parseContract throws instead of silently falling back to generic 3-criterion default. Caller retries negotiation up to 2 times before propagating the error. 3. Renegotiate on bad criteria: retry loop now detects when all criteria are failing (avg score < 4 or all below threshold) and triggers contract renegotiation mid-sprint instead of burning retries against impossible criteria. Applied to both claude-harness and codex-harness.
CodeRabbit review caught two issues: 1. Empty feedback array → division by zero → NaN avgScore 2. allFailing=true branch logged but never renegotiated (try block was inside else-if only) Fix: add feedback.length > 0 guard, restructure to outer condition gates renegotiation with inner if/else for accurate log messages.
…ation logger New mixed-harness/ — cross-model adversarial dev inspired by GAN architecture: - Generator (Claude Opus 4.6) builds code against sprint contracts - Evaluator (Codex GPT-5.4) rips apart the work in fresh context - Zero sycophancy: evaluator has no emotional investment in the code Includes all 3 hardening fixes from parent harnesses: - Iterative contract negotiation (3 rounds) - Fail-closed contract parsing (throws on garbage) - Mid-sprint renegotiation when all criteria fail ConversationLogger (shared/conversation-logger.ts): - Captures every agent prompt, response, tool call, score, and error - Saves as Obsidian-friendly markdown (.md) + machine-readable JSONL - Default output: agent-brain-vault/Projects/brane-code/debates/ - Collapsible tool calls, score badges, duration tracking Tests (29 passing): - parseContract: 9 tests (fail-closed, code blocks, garbage rejection) - Renegotiation trigger: 7 tests (thresholds, division-by-zero guard) - parseEvalResult: 3 tests (threshold recalculation, extraction) - Negotiation rounds: 3 tests (early approval, max rounds) - ConversationLogger: 7 tests (entries, markdown, JSONL, disk save)
New gemini-harness/ — cross-company adversarial dev: - Generator: Claude Opus 4.6 (Anthropic) builds code via Agent SDK - Evaluator: Gemini 3.1 Pro Preview (Google) rips it apart via @google/genai - Gemini evaluator has tool calling: readFile, runCommand, listFiles - Multi-turn chat loop handles tool calls until Gemini is done evaluating - 1M context on BOTH sides — true heavyweight matchup Same 3 hardening fixes as other harnesses: - Iterative contract negotiation (3 rounds) - Fail-closed contract parsing - Mid-sprint renegotiation on bad criteria Includes ConversationLogger integration for full transcript logging. SDK: @google/genai@1.48.0 Model: gemini-3.1-pro-preview (1M input, 65K output)
- Gemini evaluator: sandbox tool handlers with path confinement,
command allowlisting (execFileSync instead of execSync), and
fs.readdir instead of shell-interpolated find
- Fix fragile "APPROVED" exact-match in all 4 harness negotiation
loops — now case-insensitive startsWith
- Remove personal vault path from logDir defaults (use ./logs)
- Remove brane-streaming-fix.md (task doc, not project code)
- Remove unused imports in mixed/gemini harnesses
- Fix copy-paste comment ("Codex" -> "Gemini" in gemini harness)
- Gemini sandbox: use realpath() instead of resolve() to prevent symlink-based path traversal escape - Gemini sandbox: reject runCommand args with absolute paths outside workspace (prevents `cat /etc/passwd`, `grep -r secret /etc`) - All evaluators: guard against empty feedback array where [].every() returns true — prevents silent false pass - ConversationLogger: empty JSONL returns "" not bare newline
Security (Gemini evaluator sandbox):
- Remove node/npm/npx/bun/bunx from runCommand allowlist — prevents
arbitrary code execution via `node -e "..."`
- Restrict git to read-only subcommands (log, status, diff, show,
ls-files, rev-parse) — prevents data exfiltration via git push
- Block find -exec/-execdir/-delete flags — prevents subprocess spawn
- Keep path containment (realpath + absolute path rejection) from
previous commit
Polish:
- Fix wrong evaluator name in gemini generator prompt ("Codex" -> generic)
- Remove unused readContract import from claude-harness
- Fix README: wrong default model (sonnet -> opus), add mixed/gemini
harness sections to Quick Start and Project Structure
RESULTS.md — full scoreboard, key findings, and analysis from 4 harness runs (Claude 5/5, Codex 0/1, Mixed 11/13 on S4, Gemini 5/5). examples/gemini-run-excerpt.md — first 150 lines of actual Gemini evaluator conversation log showing the ConversationLogger output format.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Results - first multi-model harness runs
Cross-model evaluation caught bugs that self-evaluation missed:
Structural fixes to harness logic
Gemini evaluator sandbox - 5 layers of defense
New files
Test plan